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Tuberculosis (TB) is an air-borne infectious diseases caused by M. tuberculosis 
bacteria that primarily affects human lungs. Existing vaccine does not work 


well due to the evolvement and latent movement of this bacteria. Develop- 
ing an effective vaccine to combat Tuberculosis is very difficult as the inter- 
action between the bacteria and human immune system is not fully understood. 
With recent advancement of transcriptome profiling analysis, long noncoding ri- 
bonukleat acids (IncRNAs) are found to be widely expressed in immune system. 
However, the role of IncRNAs is still not been widely explored in understanding 
human immune response to TB infection. In this paper, we propose a general 
framework for predicting IncRNAs being expressed in human dendritic cells. By 
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IncRNAs incorporating deep learning method with RNA-seq data analysis, we intend to 
RNA-seq expression analysis identify and characterize the IncRNAs found in dendritic cells from two groups 
Tuberculosis of TB resistant patients through their RNA-seq expression data. 
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1. INTRODUCTION 

Tuberculosis (TB) is one of the most deadly contagious disease in the world that caused by antibiotic 
resistant bacteria known as Mycobacterium tuberculosis (Mtb). It primarily affects human lung and could 
travel through the bloodstream to infect other part of human body. Even worse, the bacteria could travel to the 
meninges, which are the membranes surrounding the brain and spinal cord. Since the function of meninges 
is to protect human central nervous system, the infected meninges causes a rise in pressure within the skull, 
resulting in nerve and brain tissue damage, which is often severe. This life-threatening condition is known as 
meningeal tuberculosis (TB meningitis). It has been reported that 10-20% of the TB meningitis patients will 
suffer long-term after- effect such as severe brain damage, epilepsy, paralysis, hearing loss, and blindness [I]. 

The human immune system ables to protect human body against diseases by identifying, attacking and 
destroying threats from viruses, bacteria and parasites when function properly. As for TB, the immune response 
of a patient is critical whereby it could either help the body to fight the progression of the disease or it could 
exacerbate the bacteria infection when there is involvement of certain key molecules. Even worse, tuberculosis 
is said to co-evolve with human and the ability of Mtb to manipulate human immune system to destruct lung 
tissue has made it an ultimate pathogen [2]. M. bovis Bacillus Calmette-Guérin (BCG), the vaccine that was 
introduced and being widely used since 1921 does not work well in protecting human against TB due to 
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evolvement and latent movement of Mtb. Developing an effective vaccine has been very challenging and 
difficult since the bacteria ables to evade immune system attack by co-opting the mechanisms of the immune 
system itself to its own advantage [3], [4]. The immune response against Mtb involves a network of innate 
and adaptive immune responses, where dendritic cells (DCs) are the key cells that bridge them. DCs is one 
of the main types of professional antigen-presenting cells (APCs) of the immune system [5]. Though the Mtb 
primarily resides in DCs and ables to interfere with their functions as it has the ability to impair host innate 
and adaptive immune responses, their interactions are less well understood [6]. Furthermore, existing research 
findings shows that the interaction of DCs with Mtb is contradictory [7]. 

Long non-coding ribonukleat acids (IncRNAs) are RNA molecules with length exceeding more than 
200 nucleotides, which do not code for proteins. Although IncRNAs are less understood, they do have crucial 
roles in diverse biological, pathological processes, and could cause prominent implications to various human 
diseases. They are expressed in a tissue-specific context and responsible for regulating transcriptional control. 
Recent studies show that they are functionally associated with various cancers [8]-[10] and immune-mediated 
diseases [11], [12]. They are the regulators of various immune function, where they have large effects on 
adaptive and innate immune system [6], [13], [14]. Microarray technology and traditional wet-lab exper- 
iments had been applied for many years to uncover IncRNAs differential expression in patients infected 
with tuberculosis. However, RNA-seq is proven to provide better estimates of transcript expressions [17]. As 
shown in Figure[I| we propose a general framework for predicting IncRNAs being expressed in human DCs. 
We intend to conduct RNA-seq expression analysis with convulational neural networks (CNNs), a well-known 
deep learning (DL) technique to reveal the identification and characterization of IncRNAs found in DCs asso- 
ciated to TB infection for TB resistant patients who were identified to have non-infected and infected states as 
discussed in [18]. 
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Figure 1. Schematic illustration framework of discovering IncRNAs being expressed in human DCs from 
RNA-seq data 


2. METHOD 
2.1. RNA-seq data 

The RNA-seq datasets of resistant patients are obtained in a form of FASTQ file format from sequence 
read archive (SRA) database. They are single-end reads of next generation sequencing (NGS) library, using 
Illumina HISeq 4,000 instrument, which sequencing was performed by Gilad Lab, University of Chicago [18]. 
The experiment of transcriptomic response of DCs towards Mtb infection will take into consideration the in- 
fected and non-infected group of patients. There are 26 samples altogether whereby 13 samples from infected 
patients, and another 13 samples are from non-infected patients. Information on raw RNA-Seq datasets used in 
this study are shown in Tables [I]and[2|respectively. 
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Table 1. Infected group Table 2. Non-infected group 
Sample size (bp) SRA No Sample size (bp) SRA No 
1.92G SRR5206792 1.42G SRR5206795 
8.50G SRR5206794 2.10G SRR5206803 
1.49G SRR5206796 2.50G SRR5206805 
1.36G SRR5206804 1.60G SRR5206807 
1.29G SRR5206806 1.67G SRR5206809 
1.82G SRR5206808 11.45G SRR5206811 
4.84G SRR5206810 2.63G SRR5206813 
2.17G SRR5206812 2.66G SRR5206815 
1.28G SRR52068 14 2.00G SRR5206817 
1.68G SRR5206816 2.30G SRR5206819 
10.39G SRR5206818 1.94G SRR5206821 
1.75G SRR5206820 1.43G SRR5206823 
1.5G SRR5206826 1.41G SRR5206827 


2.2. Data preprocessing and quality control 

Initial quality control (QC) assessment will be performed on all data samples as an effort of checking 
whether these data has any problems before continuing further analysis. This process will be done using 
FASTQC (version 0.11.9) software, a simple graphical user interface (GUD) tool developed in Java by Babraham 
Bioinformatics group at the Babraham Institute. In a form of graph and tables (QC report), this tool provides a 
quick access and overview on problematic areas of raw sequence data coming from high throughput pipelines 
{19}. It has 12 analysis modules, which meet the quality control standard for raw reads. 

FastQC program generates a QC report, which lead to the choice of preprocessing steps that need to 
be undertaken in order to fix the identified issues. A read trimming and filtering tool optimized for Iumina 
NGS data known as Trimmomatic will be utilized to discard low-quality reads, trim adapter sequences and 
eliminate poor quality bases. In this work, we will use Trimmomatic to trim effected data samples that contains 
error in ’per base sequence content” module as reported by QC report. After the effected reads were trimmed, 
FastQC will be run again to assess the quality of the trimmed reads. The QC report will be checked again to 
make sure that all the reads that previously have issue with per base sequence content are error free. Once the 
reads satisfied all the quality requirements, we will consider that these reads are good to go for the next step of 
RNA-seq data processes and analysis. 


2.3. Reference-based RNA-seq read alignment 

To infer which transcripts are expressed, identify genomic positions or estimate where the reads origi- 
nated from, the sample RNA-Seq reads need to be aligned against reference genome. In this work, hierarchical 
indexing for spliced alignment of transcripts 2 (HISAT2) [21], a fast and sensitive graph-based alignment pro- 
gram, which is developed to map NGS reads to a single or population of human genomes will be used to 
align these reads against the selected human reference genome. HISAT2 ables to produce higher accuracy of 
sequencing reads alignment compared to original HISAT system as it incorporates algorithmic improve- 
ments, where a hierarchical graph FM index (HGFM) is applied [23}. 

The strain of human reference genome used in this project is genome research consortium human 
build 38 (GRCh38). In order to run HISAT2, the reference genome need to be built in a form of indexes. Since 
the GRCh38 indexes are already available in HISAT2 website, we downloaded the HGFM index for reference 
plus transcripts directly from their website. These indexes use ensembl gene annotations, where it has many 
more transcripts compared to reference sequence (RefSeq) annotations [23]. The output of HISAT2 aligner 
is in sequence alignment map (SAM) format. Using Samtools, the SAM files will be sorted and converted 
into binary alignment map (BAM), which stores the same data but in a compressed binary representation for 
improved performance [24]. These BAM files will then be used as the input files in transcriptome assembly. 


2.4. Transcriptome assembly 

RNA-seq reads need to be reconstructed into a full length transcripts to allow for gene expression 
studies. For this purpose, we will be using StringTie, a transcript assembler and quantification tool for RNA- 
Seq, which was also developed by central for computational Biology (CCB) of John Hopkins University. This 
tool uses genome-guided transcriptome assembly together with de novo genome assembly approaches to im- 
prove transcript assembly. It applies network flow algorithm to estimate expression level for each transcripts 
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[25]. StringTie is claimed to be better than other leading assembler such as cufflinks as it could produce more 
complete and accurate reconstructions of genes and better expression level estimation [26], [27]. 

Each alignment files produced by HISAT2, which then will be converted into BAM format are as- 
sembled using StringTie2. StringTie2 is the latest release of StringTie program, which include a new method 
in better high rate error handling and has the ability to work with full length super-reads that brings to bet- 
ter quality of short-reads assemblies [27]. The assembly process requires for reference genome annotation in 
.gtf format. Therefore, we downloaded human GRCh38 (version 21) comprehensive gene annotation file from 
genome ENCyclopedia Of DNA elements (GENCODE) online database for this purpose. 


2.5. Data filtering 

Using in-house scripts, assembled transcripts will need to be checked and filtered to meet only criteria 
of a IncRNA.The selection criteria includes to select only transcripts with nucleotide sequence more than 200 
base pairs, minimum coverage is at least 3x and the number of exon should be more than 2 as suggested in 
[29], (30). All transcripts that do not meet these criteria will be discarded. Then, the IncRNA transcripts from 
the 26 samples will be merged into two different files according to patients group. 


2.6. Differential expression analysis 

One of the key steps in analyzing RNA-seq data is to perform the genes differential expression analysis 
between different biological conditions, where the summarized data will be assessed by statistical models [17]. 
To identify and quantify the changes in expression levels between the two groups of TB resistant patients, 
DEseq2 bioconductor software package will be used as the statistical tool to perform differential analysis of 
count data. As an improved package of DESeq [BI], DESeq2 estimates and perform statistical inference on 
differential data based on negative binomial distribution. Using shrinkage estimators for the dispersion and 
fold changes in differential expression analysis allows DEseq?2 to offer a sound, consistent performance and 
statistically well-founded solution to the wide dynamic range of RNA-seq experiments [32]. 


2.7. Deep learning method 

In predicting non coding RNAs, it had been proven that the prediction performance of tools that 
apply deep learning methods exceeded other traditional machine learning methods in terms of identification 
reliability, ease of use and ability to utilize features not incorporated in the current knowledge [33], [34]. 


2.7.1. Convulational neural networks 

The considerable success of CNNs in various visions and imaging tasks has given significant impact 
to nearly all scientific fields including bioinformatics. Alzubaidi et al. [35], CNNs is the most utilized deep 
learning network type that could automatically identifies relevant features without any human intervention and 
considered to be more powerful than recurrent neural networks (RNNs), another well-known deep learning 
algorithm capable of processing sequential data. Therefore, as shown in Figure[]] CNNs technique is proposed 
to be implemented in identifying and classifying between IncRNAs and mRNAs being expressed in human 
DCs. We intend to explore the capability of CNNs to extract information from one-dimensional biological 
sequences data as discussed by [36], [37]. 

We will enhance the IncRNAs prediction by comparing, mapping and consolidating the results yielded 
from CNNs technique with those discovered by differential expression analysis method using DESeq2. These 
processes will be done by running our own in-house scripts. We expect that the annotated and putative IncRNAs 
could be identified and any possible predicted mRNAs will be discarded. 


2.7.2. Training datasets 

Datasets of human IncRNAs from two well-known public databases, which are GENCODE and LNCi- 
pedia, and human mRNAs datasets from the RefSeq, an NCBI RefSeq database will be used as the training 
datasets to discover and learn RNAs data patterns. As reported by [38], GENCODE had suggested that there 
are more than 16,000 IncRNAs in human genome. It has comprehensive gene annotation of IncRNA genes on 
the reference chromosomes. While LNCipedia currently contains 127,802 transcripts with 107,039 are consid- 
ered as high-confidence set of IncRNAs [39]. As for RefSeq, it contains curated, non-redundant collection of 
sequences representing genomes, transcripts and proteins [40]. 
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3. CONCLUSION 


One of the important research applications of RNA-seq these days is to discover the IncRNAs ex- 
pression profiles using computational tools and pipelines. There are studies to uncover IncRNAs differential 
expressions in patients with tuberculosis infections using microarray technology and traditional wet-lab ex- 
periments. However, RNA-seq is proven to come up with better estimates of transcript expressions. The 
sophisticated high-throughput RNA sequencing technology allows researchers to characterize and quantify dif- 
ferential expressions with higher sensitivity, higher speed and higher dinamic range. To further enhance and 
improve the prediction results, we propose a framework to discover IncRNA transcripts being expressed in 
human DCs of two TB resistant patient groups by incorporating CNNs classification technique with existing 
RNA-Seq expression analysis. 
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